Exploring the Relationship between GDP per capita and Life Expectancy

Author

Adrian Tran, Harshitha Bachina, Hayden King, Josh Shneyder

Summary of Data

Data Description

In this final project report, we decided to explore the relationship between the inflation-adjusted GDP per capita in US dollars and the Life Expectancy at birth. We retrieved this data from Gapminder, which is a data source that combines data from multiple sources into a coherent time series. The life expectancy data set shows the average number of years a newborn child would live if current mortality rates stay the same. The data set includes data on almost every country and includes rates from 1799 to 2099. The inflation-adjusted GDP per capita in US dollars covers almost every country from 1959 to 2019, but there is some missing data in the earlier years for less developed countries. We joined these data sets together to analyze the relationship between GDP per capita and Life Expectancy between 1959 and 2019.

This project often utilizes the \(R^2\) value as a comparative measure for goodness of fit for a given model. This statistical measure determines the variation of a dependent variable that can be explained by an independent variable. A higher \(R^2\) signifies a stronger relationship between the two variables. The value of the statistic ranges between 0 and 1, with 1 signifying a perfect relationship and 0 signifying no correlation at all.

Linear Visualization

To visualize the relationship between these variables, we started with a scatterplot.

Code
ggplot(data = joined_data, aes(x = GDP, y = Life_Expectancy)) +
  geom_point(alpha = 0.1, color = "orange") +
  labs(
    x = "GDP per capita",
    y = "",
    subtitle = "Life Expectancy",
    title = "Relationship between GDP per capita and Life Expectancy"
  ) +
  geom_smooth(method = "lm")

The relationship between \(GDP\) and \(LifeExpectancy\) is approximately logarithmic, so we did a log transformation of \(GDP\). We called the new variable \(lGDP\). \[lGDP = ln(GDP)\]

Logarithmic Visualization

Code
lineargraph = ggplot(data = joined_data, aes(x = lGDP, y = Life_Expectancy)) +
  geom_point(alpha = 0.1, color = "orange") +
  labs(
    x = "Ln(GDP per capita)",
    y = "",
    subtitle = "Life Expectancy",
    title = "Relationship between Ln(GDP per capita) and Life Expectancy"
  ) +
  geom_smooth(method = "lm")
lineargraph

Code
logmodel = lm(Life_Expectancy ~ lGDP, data = joined_data)

We will fit a regression line to this transformed data. From here on out, we will examine only the relationship between \(lGDP\) and \(Life Expectancy\).

Linear Regression of Transformed Data

Assumptions of the Linear Regression Model

To fit a linear regression we need to check 5 assumptions. Some of these assumptions are violated to differing degrees, but none of them are severely violated. The high sample size makes it so that very small violations of the assumptions can be detected.

1. Linearity

As shown above, the data still shows some curvature after the log transformation.

2. Random Sampling/ No Autocorrelation

The assumption of random sampling isn’t met, but we can substitute the assumption of no autocorrelation. The data shows partial autocorrelation of about 0.2 which is relatively stable across various lags. The Durbin Watson test gives a p-value of 0 for the autocorrelation, meaning there is certainly autocorrelation in the data, so this assumption isn’t met either. The blue dotted lines represent a 95% significance level.

Code
joined_data <- joined_data |>
  mutate(residuals = logmodel$residuals)
acf(joined_data$residuals, type = "correlation")

3. No colinearity

Since there is only one explanatory variable (\(lGDP\)), there can’t be any colinearity.

4. Homoskedasticity

Homoskedasticity means that the residuals at each point on the regression line have equal variance. In other words, the fitted values cannot predict the residuals. We conducted the Breusch-Pagan test to determine if Heteroskedasticity (unequal variances of residuals across the regression line) is present in the data.

Code
bptest(logmodel)

    studentized Breusch-Pagan test

data:  logmodel
BP = 301.4, df = 1, p-value < 2.2e-16

\[BP = n \times R^2_{\hat u^2} = 301.4 \] \[9162 \times R^2_{\hat u^2} = 301.4\] \[R^2_{\hat u^2} = 0.033\] While the data isn’t homoskedastic, the Breusch-Pagan test statistic shows that only 3.3% of the variance in \(u_i\) can be explained by the variance in \(lGDP\). We can say this because \(var(u_i) = E(u^2_i) - E(u_i)^2\) and \(E(u_i)=0\). While this heteroskedasticity is very much present with a p-value of 0, that is more a function of the huge sample size than a significant amount of the variation in \(u^2\) being explained by \(lGDP\).

5. Normality of \(u\) (Residuals)

Code
ggplot(data = joined_data, aes(x = residuals)) +
  geom_histogram(fill = "steelblue") +
  labs(
    y = "",
    subtitle = "Count",
    x = "Residuals",
    title = "Histogram of Residuals"
  )

While the distribution is skewed left, it isn’t severe. This is because of the low outliers mentioned earlier. Any negative effects of this assumption not being met should be made up for by the high sample size.

Model Fit

Linear regression models can be assessed by how much of the variance in the response they account for. We can calculate this with the \(R^2\) statistic mentioned earlier.

Code
variance_table = matrix(nrow = 1, ncol = 3)
colnames(variance_table) = c("Variance of Response",
                             "Variance of Fitted Values",
                             "Variance of Residuals")
variance_table[1, 1] = var(joined_data$Life_Expectancy)
variance_table[1, 2] = var(logmodel$fitted.values)
variance_table[1, 3] = var(logmodel$residuals)
variance_table[1,] = round(variance_table[1,], digits = 2)
kable(variance_table) |>
  kable_material_dark()
Variance of Response Variance of Fitted Values Variance of Residuals
97.63 59.34 38.29

\[R^2 = 1-\frac{38.29}{97.63} = \frac{59.34}{97.63} = 0.608\] The variance in \(lGDP\) accounts for 60.8% of the variance in \(Life Expectancy\). This is a moderate \(R^2\), so this model does a decent job of explaining the response. Considering how many factors may influence \(Life Expectancy\) across countries and time, this is a good quality model.

Estimated Model

The following model was obtained by minimzing the squared residuals of a linear model fitted to the data. This process is called OLS (Ordinary Least Squares.) \[\widehat{Life Expectancy} = 22.84 + 5.312lGDP\] The logarithmic transformation of GDP per capita makes the linear model fit quite well. This model predicts that if GDP per capita increases by 1%, mean \(LifeExpectancy\) increases by approximately 0.05312 years. At a GDP per capita of 1 dollar, mean \(LifeExpectancy\) is 22.84 years.

Changes Over Time

The regression model we estimated combines data for the years 1959 through 2019. Since so much has changed over the course of those 60 years, we estimated a separate OLS model for each year to look at the changes over time.

Code
years = 1959:2019
lms_year = matrix(nrow = length(years), ncol = 2)
lms_year[, 1] = years
coef_year = matrix(nrow = length(years), ncol = 3)
coef_year[, 1] = years
for (y in years) {
  year_data = joined_data |>
    filter(Year == y)
  year_model = lm(Life_Expectancy ~ lGDP, data = year_data)
  lms_year[y - 1958, 2] = summary(year_model)$r.squared
  coef_year[y - 1958, 2] = year_model$coefficients[1]
  coef_year[y - 1958, 3] = year_model$coefficients[2]
}
lms_year = data.frame(lms_year)
colnames(lms_year) = c("Year", "Rsquared")
coef_year = data.frame(coef_year)
colnames(coef_year) = c("Year", "Beta0", "Beta1")

ggplot(data = lms_year, aes(x = Year, y = Rsquared)) +
  geom_line(color = "steelblue") +
  ylim(0, 1) +
  labs(y = "",
       subtitle = "R-Squared",
       title = "R-Squared of ln(GDP per capita) and Life Expectancy over time")

While variation in \(lGDP\) has consistently explained about 60% of variation in \(LifeExpectancy\), the nature of this relationship has changed significantly over time. There have also been dips in the \(R^2\) corresponding to outliers in the data.

Code
library(patchwork)
beta0 = ggplot(data = coef_year, aes(x = Year, y=Beta0)) +
  geom_line(color = "steelblue") +
  labs(y = "",
       subtitle = "Estimated Intercept",
       title = "Estimated Intercept Over Time")

beta1 = ggplot(data = coef_year, aes(x = Year, y=Beta1)) +
  geom_line(color = "steelblue") +
  labs(y = "",
       subtitle = "Estimated Slope",
       title = "Estimated Slope of ln(GDP per capita) Over Time")
beta0+beta1

The left graph shows how the mean \(LifeExpectancy\) when GDP per capita is 1 dollar has increased dramatically over time. The right graph shows that while a 1% increase in GDP per capita increased \(LifeExpectancy\) by almost 0.07 years in 1959, it now does so by only around 0.04 years.

Differences Between Countries

There are no countries in the data that have suddenly increased their life expectancy without increasing their GDP, resulting in a high positive residual. There are, however, several low outliers in the data that result from countries going through traumatic events that decrease life expectancy. These low outliers can be seen on the interactive scatterplot below.

Code
library(plotly)
plot_ly(
  joined_data,
  x = ~ lGDP,
  y = ~ Life_Expectancy,
  text = ~ paste("Country: ", country, '<br>Year:', Year)
)

Low Outliers

Code
countries = joined_data |>
  select(country) |>
  distinct()

lms_country = matrix(nrow = length(countries), ncol = 2)
lms_country[, 1] = countries
lms_country = data.frame(lms_country)
colnames(lms_country) = c("Country", "Rsquared")
idnum = 1
for (c in countries$country) {
  country_data = joined_data |>
    filter(country == c)
  country_model = lm(Life_Expectancy ~ lGDP, data = country_data)
  lms_country[idnum , "Rsquared"] = summary(country_model)$r.squared
  idnum = idnum + 1
}
ggplot(data = lms_country, aes(x = Rsquared)) +
  geom_histogram(binwidth = 0.05, fill = "steelblue") +
  labs(
    x = "R-squared",
    y = "",
    subtitle = "Count",
    title = "Histogram of Countries by R-Squared of ln(GDP per capita) and Life Expectancy"
  )

While the relationship between \(lGDP\) and \(Life Expectancy\) is quite strong for most countries, for some countries it is much weaker. This is mostly due to events that decrease life expectancy such as epidemics, famines, natural disasters and wars. Botswana for example had a linear relationship between \(lGDP\) and \(Life Expectancy\) until the AIDS crisis hit. This drop in \(Life Expectancy\) while \(lGDP\) continued to grow is responsible for its \(R^2\) of 0.0013.

Code
outliers <- joined_data |>
  filter(abs(residuals) > 22) |>
  mutate(lGDP = round(lGDP, digits = 2),
         residuals = round(residuals, digits = 2))
kable(
  outliers,
  col.names = c(
    "Year",
    "Country",
    "GDP per Capita",
    "Life Expectancy",
    "ln(GDP per Capita)",
    "Residual"
  )
) |>
  kable_material_dark("hover")
Year Country GDP per Capita Life Expectancy ln(GDP per Capita) Residual
1959 Cameroon 932 27.5 6.84 -31.59
1959 China 238 27.8 5.47 -24.05
1960 Cameroon 924 26.4 6.83 -32.64
1971 Burundi 331 17.1 5.80 -36.50
1993 Rwanda 216 9.5 5.38 -41.84
1999 Botswana 4570 45.5 8.43 -22.02
2000 Botswana 4490 45.0 8.41 -22.42
2001 Botswana 4680 44.9 8.45 -22.74
2002 Botswana 4820 45.7 8.48 -22.10
2009 Haiti 1300 32.5 7.17 -28.35

These low residuals are associated with several significant events, including the Cameroonian War of Independence and resulting Civil War, the Great Leap Forward and subsequent famine, the 1993 Rwandan Genocide and the AIDS crisis in Southern Africa (particularly in Botswana.) We are unsure about the cause of the dips in life expectancy in Haiti in 2009 and Burundi in 1972. It would make more sense if life expectancy dipped in 2010 in Haiti due to a 7.0 magnitude earthquake and Burundi in 1973 due to a genocide of Hutus. There may be a quirk in the data collection process. According to the authors of the dataset, some of the data points are “rough guesstimates.”

Countries with a Weak Relationship

Code
joined_data |>
  filter(country == "Botswana") |>
  ggplot(aes(x = lGDP, y = Life_Expectancy)) +
  geom_point() +
  labs(
    x = "ln(GDP per capita)",
    y = "",
    subtitle = "Life Expectancy",
    title = "Life Expectancy and ln(GDP per capita) in Botswana"
  ) +
  geom_smooth(method = "lm")

Botswana had an approximately linear relationship between \(lGDP\) and \(LifeExpectancy\) until the AIDS crisis hit.

Code
lms_country |>
  mutate(Rsquared = round(Rsquared, digits = 3)) |>
  slice_min(order_by = Rsquared, n = 8) |>
  kable() |>
  kable_material_dark("hover")
Country Rsquared
Liberia 0.000
Botswana 0.001
Mauritania 0.002
Yemen 0.002
Gambia 0.003
Sierra Leone 0.004
Central African Republic 0.015
Solomon Islands 0.018

Countries that have suffered from natural disasters and prolonged civil wars make up most of the countries with weak relationships.

Countries with a Strong Relationship

Code
joined_data |>
  filter(country == "Germany") |>
  ggplot(aes(x = lGDP, y = Life_Expectancy)) +
  geom_point() +
  labs(
    x = "ln(GDP per capita)",
    y = "",
    subtitle = "Life Expectancy",
    title = "Life Expectancy and ln(GDP per capita) in Germany"
  ) +
  geom_smooth(method = "lm")

In countries without life expectancy-decreasing events since 1959, there can be a very strong linear relationship. In the case Germany, the \(R^2\) is 0.983.

Code
lms_country |>
  mutate(Rsquared = round(Rsquared, digits = 3)) |>
  slice_max(order_by = Rsquared, n = 8) |>
  kable() |>
  kable_material_dark("hover")
Country Rsquared
South Korea 0.983
Germany 0.983
Sao Tome and Principe 0.983
Lao 0.980
Colombia 0.978
Hong Kong, China 0.978
Djibouti 0.978
Egypt 0.974

Simuluation

We are going to do two simulations to validate our model.

Simulated Distribution

First, we generated a simulated distribution by taking each of the fitted values and adding noise based on the standard error of the residuals.

Code
set.seed(1738)
sigma = sigma(logmodel)
predictions = predict(logmodel)
simulation <- function(x, rse) {
  x <- x + rnorm(n = length(x), mean = 0, sd = rse)
  return(x)
}
actualhist = ggplot(data = joined_data, aes(x = Life_Expectancy)) +
  geom_histogram(binwidth = 2, fill = "steelblue") +
  labs(
    x = "Life Expectancy",
    y = "",
    subtitle = "Count",
    title = "Observed Distribution"
  )
simhist = ggplot(data = joined_data, aes(x = simulation(predictions, sigma))) +
  geom_histogram(binwidth = 2, fill = "steelblue") +
  labs(
    x = "Simulated Life Expectancy",
    y = "",
    subtitle = "Count",
    title = "Simulated Distribution"
  )
actualhist + simhist

Because of the left skew of the data, the simulated distribution doesn’t exactly match the actual distribution. Due to the high sample size (9162 observations,) this isn’t a huge issue as far as fitting the model.

Multiple Predictive Checks

Second, we generated 2000 simulated data sets using the method above and found the \(R^2\) of each one. This shouldn’t be compared to the \(R^2\) of 0.608 found earlier.

Code
simulate <- function() {
  x = predictions + rnorm(n = length(joined_data$lGDP),
                          mean = 0,
                          sd = 6.188)
  simmodel = lm(joined_data$Life_Expectancy ~ x)
  return(summary(simmodel)$r.squared)
}
ggplot(data = NULL, aes(x = map_dbl(1:2000, ~ simulate()))) +
  geom_histogram(fill = "steelblue") +
  labs(
    x = "Simulated R-squared of Life Expectancy and Simulated ln(GDP per capita)",
    y = "",
    subtitle = "Count",
    title = "R-squared Simulation"
  )

Conclusion

After exploring the data and fitting a model, we can conclude that there is a moderate positive relationship between ln(GDP per capita) and life expectancy. This relationship has held over time, although it is sometimes thrown off by traumatic events.